Models aren’t explicitly programmed - they learn from data!
Rules + Data → Answers
Data + Answers → Rules
Different data types require different models and architectures
Storage price
In 2012, AlexNet marked a breakthrough in deep learning by training the first large-scale convolutional neural network:
Six days of training with 2 x GTX 580 3GB GPU !
All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available1
Up to 8 GPUs per node, 640 GB of VRAM !
Nvidia DGX H100
Machine learning can be categorized into three main paradigms:
The model receives input-output pairs \((X, Y)\) and learns to map \(X\) to \(Y\).
We will talk about:
In parallel !
Neuron
\[ y = \sigma ( \sum_{i=1}^{n} w_ix_i+b)\]
\(n\) input dimensionality. \(b\) constant bias term. \(w_i\) weights for each input. \(\sigma\) is the activation function.
\[y_j = \sigma ( \sum_{i=1}^{n} w_{ij}x_i+b_j)\]
\(n\) is the input dimensionality. \(m\) is the number of neurons in a layer.
\[y_j = \sigma ( \sum_{i=1}^{n} w_{ij}x_i+b_j)\]
But what if we process many inputs at once?
Then, we can perform matrix-matrix multiplication, which increases arithmetic intensity - more computations per memory access.
This is why we use accelerators like GPUs
To allow neural networks to learn complex, non-linear patterns, we use activation functions between layers.
Multi layer perceptrons (MLP) are the simplest form of neural networks
A feedforward neural network with at least one hidden layer can approximate any continuous function to any desired accuracy, given enough neurons and the right activation function.
What this means in practise:
Train a model mean:
Find the best paramters \(w\) and \(b\) to solve the task, just by looking at data samples.
Core elements needed to train a neural network:
Usually we use some data to train the model and other point to asses the model performance. We will call the partition: trainset and testset
Datasets are typically too large to fit entirely in memory.
To handle this, we split the data into batches - small subsets of the dataset processed one at a time.
Usually a function that quantifies how well the model’s predictions match the true labels. It is a scalar.
Regression tasks:
Mean squared Error (L2 norm)
Mean absolute error (L1 norm)
Classification tasks:
To train a neural network, we need to compute how the loss changes with respect to each weight.
Way to go: Backpropagation algorithm1.
It give us:
\[ \frac{\partial L}{\partial w_{ij}} = \frac{\partial L}{\partial y_j} \cdot \frac{\partial y_j}{\partial w_{ij}} \]
where \(L\) is the loss, \(y_j\) is the output of neuron \(j\), and \(w_{ij}\) is the weight connecting input \(i\) to neuron \(j\).
Source doi: 10.1109/UEMCON51285.2020.9298092
We use the gradient to update the weights iteratively: \[ w_{ij} \leftarrow w_{ij} - \eta \cdot \frac{\partial L}{\partial w_{ij}} \]
where \(\eta\) is the learning rate.
We as humans need to set some hyperparameters before training:
These are just concepts, there are no standard on their names !
Threoretical speedup: \(N\) times faster with \(N\) GPUs.
Each GPU gets a slice of the data and computes the gradients independently.
After each batch, gradients are averaged across GPUs.
Memory constraints: each GPU must fit the entire model.
Communication overhead: synchronizing gradients can be costly.
Split the model into stages, each stage runs on a different GPU.
But run multiple batches in parallel avoiding idle time !
Tensor parallelism: source huggingface.co
This form of parallelism has influenced the design of both hardware and software systems:
NVIDIA DGX H100 Topology
Built on top of this advanced hardware are highly optimized software libraries designed to minimize communication overhead and maximize performance.
From 0 to distributed training